Applications and Accuracy 

of 

the Parallel Diagonal Dominant Algorithm * 

Xian-He Sun 
ICASE 

Mail Stop 132C 

NASA Langley Research Center 
Hampton, VA 23681-0001 
(804) 864-8018, sun@icase.edu 


ABSTRACT 

The Parallel Diagonal Dominant (FDD) algorithm is a highly efficient, ideally scalable tridiago- 
nal solver. In this paper, a detailed study of the PDD algorithm is given. First the PDD algorithm 
is introduced. Then the algorithm is extended to solve periodic tridiagonal systems. A variant, 
the reduced PDD algorithm, is also proposed. Accuracy analysis is provided for a class of tridi- 
agonal systems, the symmetric and anti-symmetric Toeplitz tridiagonal systems. Implementation 
results show that the analysis gives a good bound on the relative error, and the algorithm is a good 
candidate for the emerging massively parallel machines. 
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1 Introduction 


Solving tridiagonal systems is one of the key issues in computational fluid dynamics (CFD) and 
many other scientific applications [21, 13]. Many methods used for the solution of partial differential 
equations (PDEs) rely on solving a sequence of tridiagonal systems. The alternating direction 
implicit (ADI) method, the most widely used implicit method for PDEs [17], solves PDEs by solving 
tridiagonal systems alternately in each coordinate direction. Discretization of partial differential 
equations by compact difference schemes also leads to a sequence of tridiagonal systems. Tridiagonal 
systems also arise in multigrid methods and in ADI or line-SOR preconditioners for conjugate 
gradient methods. In addition to solving PDE’s, tridiagonal systems also arise in many other 
applications [1]. 

Solving tridiagonal systems is inexpensive on sequential machines. However, because of their 
serial nature, tridiagonal systems are difficult to solve efficiently on parallel computers. Thus 
intensive research has been done on the development of efficient parallel tridiagonal solvers. Many 
algorithms have been proposed [14, 8]. Among them, the recursive doubling reduction method 
(RCD), developed by Stone [16], and the cyclic reduction or odd-even reduction method (OER), 
developed by Hockney [9], are able to solve an n-dimensional tridiagonal system in 0(log n) time 
using n processors. These are effective algorithms for fine grained computing. Later, several 
algorithms were proposed for median and coarse grain computing, i.e. for the case of p < n or 
p << n, where p is the number of processors available [5, 11, 22]. The algorithm given by Lawrie 
and Sameh [11] and the algorithm given by Wang [22] can be considered substructured methods. 
These algorithms partition the original problem into sub-problems. The sub-problems are solved in 
parallel, and then the solutions of the sub-problems are combined to obtain the final solution. All 
of these parallel tridiagonal solvers increase parallelism by adding additional computation. They 
trade increased work for reduced communication overhead and better load balance and have a 
larger operation count than the best sequential algorithm. 

Recently, Sun, Zhang, and Ni [21] have proposed three parallel algorithms for solving tridiagonal 
systems. All of these algorithms are based on Sherman- Morrison matrix modification formula [3]. 
The parallel partition LU (PPT) algorithm and the parallel hybrid (PPH) algorithm are fast and 
able to incorporate limited pivoting. The PPT algorithm is a good candidate when the number of 
processors, p, is small. The PPH algorithm is a better choice when p is large. Finally, for diagonal 
dominant problems, the (PPD) algorithm is the most efficient. 

Compared with other tridiagonal solvers, which all have at least O(log p) communication cost, 
the PDD algorithm has only a small fixed communication cost and a small amount of additional 
computation. In fact, the PDD algorithm is perfectly scalable, in the the sense that the communi- 
cation cost and the computation overhead do not increase with the problem size or with the number 
of processors available. 
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Modern technological advances have made it possible to build computers containing more and 
more processors. Multiprocessors with hundreds or thousands of processors are commercially avail- 
able. Recent parallel computers, such as the Intel Paragon, Thinking Machine Corporations’s 
CM-5, and Cray’s MPP, highlight the use of high-density and high-speed processor and memory 
chips based on ultra-large-scale integration (ULSI) or very high-speed integrated circuits (VHSICs). 
With this new technology, 64-bit 150-MHz microprocessors, for example, are now available on a 
single chip having 1.6-million transistors [10]. The emerging parallel computers build on such mi- 
croprocessors are noted for their scalable architecture and massively parallel processing. They are 
designed for grand challenge applications which could not otherwise be tackled. 

Scalability has become an important metric of parallel algorithms [6, 20, 19]. Its perfect scala- 
bility and high efficiency make the PDD algorithm, when applicable, an ideal choice on these new 
machines. However, the PDD algorithm is relatively new and applicable only under certain condi- 
tions. In this paper we give a detailed study of the PDD algorithm. We study the applications of 
the PDD algorithm and provide a formal accuracy analysis for Toeplitz tridiagonal systems. The 
PDD algorithm proposed in this paper is slightly different from the algorithm proposed in [21]. 
Extended study is provided for different applications, such as periodic systems, and systems with 
multiple right-sides. The reduced PDD algorithm is also proposed. Simple formulas are provided 
for accuracy checking for symmetric and anti-symmetric Toeplitz tridiagonal systems. 

This paper is organized as follows. Section 2 will introduce the sequential and parallel PDD 
algorithms. The applications of the PDD algorithm will be discussed in Section 3. This section will 
also give the variant of the PDD algorithm for periodic systems and the reduced PDD algorithm. 
Section 4 will give the accuracy study for the PDD algorithm and the reduced PDD algorithm. 
Experimental results on an Intel/860 multicomputer will be presented in Section 5. Finally, Section 
6 gives the conclusion and final remarks. 

2 The Parallel Diagonal Dominant Algorithm 

We are interested in solving a tridiagonal linear system of equations 

Ax = d (1) 


where A is a tridiagonal matrix of order n 


A = 


i>o c o 
a\ b\ c\ 


[ai,bi,Ci] 


&n — 2 ^n — 2 ^n— 2 

n n — i b n —\ 


( 2 ) 
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x — (^or's^n-i) 7 and d — (do, - ■ ■ ,d n -i) T . We assume that A,x, and d have real coefficients. 
Extension to the complex case is straightforward. 

2.1 The Matrix Partition Technique 

To solve Eq. (1) efficiently on parallel computers, we partition A into submatrices. For convenience 
we assume that n = pm. The matrix A in Eq. (2) can be written as 

A = A + AA, (3) 


where 
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The submatrices A i(i = 0, • • - ,p— 1) are mxm tridiagonal matrices. Let e, be a column vector 
with its ith (0 < * < n — 1) element being one and all the other entries being zero. We have 


"771 — l 


AA — c m— 1 5 c 2m-l e 2m-l > * * ‘ > c (p-l)m — l € (p-l) 771 — l] 


= VE j 


^(p-l)m-l 

e (p-l)m J 


(4) 


where both V and E are n x 2(p — 1) matrices. Thus, we have 


A = A + VE 1 . 


(5) 


Based on the matrix modification formula originally defined by Sherman and Morrison [15] for 
rank-one changes and generalized by Woodbury [23], and assuming that all A{ y s are invertible, Eq. 
(1) can be solved by 


x = A -1 d = (A + VE T ) -1 d 

= A -1 d — A -1 V(I + E T A -1 V) -1 E T A -1 d. 


x 
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( 6 ) 

(7) 



Note that / is an identity matrix and Z = I + E T A 1 V is a pentadiagonal matrix of order 2 (p- 1). 
Let 


Ax = d (8) 

AY = V (9) 

h = E t x (10) 

Z = I + E t Y (11) 

Zy = h (12) 

Ax — Yy . (13) 


Equation (7) becomes 


x = x — Ax. 


(14) 


In Eq.s (8) and (9), x and Y are solved by the LU decomposition method. By the structure of 
A and V , this is equivalent to solve 


/l t [x (,) ,u (,) ,u; w ] = [d (,) ,a tm e 0 ,C( t+1 ) m _ 1 e m _ 1 ], (15) 

i = 0, • • • ,p— 1. Here x^ and d O are the ith block of x and d, respectively, and v^'\ w are possible 
nonzero column vectors of the ith row block of Y . Equation (15) implies that we only need to solve 
three linear systems of order m with the same LU decomposition for each i (i = 0, 1). 

In addition, we can skip the first m — 1 forward substitutions for the third system since the first 
m — 1 components of the vector at the right-hand side are all zeros. There is no computation or 
communication involved in computing h and Z. 


2.2 The PDD Algorithm 


Solving Eq. (12) is the major computation involved in the conquer part of our algorithms. Different 
approaches have been proposed for solving Eq.(12), which results different algorithms for solving 
tridiagonal systems [21]. The matrix Z in Eq. (12) has the form 


Z = 


w. 
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V {1) 
V m - 1 


( 0 ) 
m — 1 
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w. 


m— 1 
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W. 


>-D 


(P-2) 

0 

(P-2) 

m — 1 


(16) 


where for i = 0 , 1 are solutions of Eq. (15) and the l’s come from the identity 

matrix I. In practice, especially for a diagonal dominant tridiagonal system, the magnitude of the 
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last component of v^_ lf and the first component of w^\ may be smaller than machine 
accuracy when p << n. In this case, and vj^_ 1 can be dropped, and Z becomes a diagonal block 
system consisting of (p - 1) 2 x 2 independent blocks. Thus, Eq.(12) can be solved efficiently on 
parallel computers, which leads to the highly efficient parallel diagonal dominant (PDD) algorithm. 

In the sequential PDD algorithm, since Y has at most two nonzero entries in every row, and Z 
is a diagonal block matrix with l’s as diagonal elements, (12) takes five arithmetic operations per 
row, and the evaluation of (13) takes four operations per row. Based on the above observations, 
and together with a careful scaling process, we conclude that the sequential PDD algorithm takes 
17n — 9^ — 4 p — 9 arithmetic operations. 

Using p processors, the PDD algorithm consists of the following steps: 

Step 1 . Allocate Ai,d( x \ and elements C( 1+1 ) m _ a to the ith node, where 0 < i < p — 1 . 

Step 2. Solve (15). All computations can be executed in parallel on p processors. 

Step 3. Send x^\ from the ith node to the ( i - l)th node, for i = 1, • • • - 1. 


Step 4. Solve 


1 1 / Vii \ = f \ 

v { 0 +1) 1 J \ 2/2, +1 ) { 4 ’ +1) ) 


(17) 


in parallel on the ith node for 0 < i < p — 2. Then send , from the ith node to the (i + l)th 
node, for i = 0, • • - ,p - 2. 


Step 5. Compute (13) and (14). We have 


= [«<*•),«,(*•)] ^ j 

x = x ^ 

In all of these, one has only two neighboring communications. 

Communication cost is an overhead of parallelism. Recent advanced communication mecha- 
nisms, such as circuit switching and wormhole routing, have reduced communication delay con- 
siderably. However, compared with the improvement of processing speed, the improvement of 
communication speed is relatively small. Communication cost has a great impact on overall per- 
formance. Empirically, for most distributed-memory computers, the communication time for a 
neighboring communication is a linear function of the problem size [4]. Let S be the number of 
bytes to be transferred. Then the transfer time of a neighboring communication can be expressed 
as a + 5/3, where a represents a fixed startup overhead and /3 is the incremental transmission time 
per byte. Assuming 4 bytes are used for each real number, Step 3 and Step 4 take a + 8/3 and a + 4/3 


(18) 

(19) 
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communication respectively. The parallel PDD algorithm needs 17£ - 4 parallel computation and 
2(a + 6/3) communication. 

2.3 Scalability Analysis 

As parallel machines have been built with more and more processors, the performance metric 
scalability becomes more and more important. Thus, the question is how an algorithm will perform 
when the problem size is scaled up linearly with the number of processors. Let T(p, W) be the 
execution time for solving a system with W work (problem size) on p processors. The ideal situation 
would be when both the number of processors and the amount of work are scaled up N times, the 
execution time remains unchanged: 

T(N x p,N x W) — T(p , W ) (20) 

How one should define problem size, in general, is a style under debate. However, it is commonly 
agreed that the floating point (flop) operation count is a good estimate of problem size for scientific 
computations. To eliminate the effect of numerical inefficiencies in parallel algorithms, in practice 
the flop count is based upon some practical optimal sequential algorithm. In our case, the LU 
decomposition has chosen as the sequential algorithm. It takes 8 n 7 floating point operations, 
where 7 is a negligible constant number when n is large. As the problem size W increases N times 
to W' , we have 

W = N x 8n = 8n' ^ 

n' = N • n. 

Let T comp represent the unit of a computation operation normalized to the communication time. 
The time required to solve (1) by the PDD algorithm with p processors is 

T(p, W) = (17^ - 4 )T comp + 2(a + 6/3), (22) 

and 

T(N xp,N xW) = (17^ - 4)r comp + 2 (a + 6/3) 

= (17&F - 4)r c omp + 2(a + 6/3) (23) 

= (17* - 4)r comp + 2(a + 6j3) 

= T(p,W ). 

The PDD algorithm has the ideal scalability. Similar arguments could be applied to periodic 
systems (see Section 3) and the same result would be obtained. 

Using the isospeed approach, scalability has been formally defined in [20]. The average unit 
speed is defined as the quotient of the achieved speed of the given computing system and the number 
of processors. Since Eq.(20) is true if and only if the average unit speed of the given computing 
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system is a constant, the scalability is defined as the ability to maintain the average unit speed 
[20]. Let W be the amount of work of an algorithm when p processors are employed in a machine, 
and let W { be the amount of work of the algorithm when p f = N • p processors are employed to 
maintain the average speed, then the scalability from system size p to system size N • p of the 
algorithm-machine combination is defined as 


*l>(p,N x p) = 


N -p-W 
P‘W' 


N -W 
W ' 


(24) 


The average unit speed can be represented as 


AS(p,W) 


w 

p-T(p,wy 


(25) 


where W is the problem size, p is the number of processors, and T(p,W) is the corresponding 
execution time. From our early discussion, for the PDD algorithm, when W' = N ■ W, we have 
T(N x p,W') = T(p, W). Therefore 


A_5(iV x p, W') = 


W’ 


W' 


N -W 


W 


N • T(N x p, W') N ■ T(p,W) N • T(p, W) T(p,W)' 


(26) 


That is W' = N • W has maintained the average unit speed, and the scalability is 


V>(p, N x p) = 


N -W 
W 1 


N -W 
N ■ W 


(27) 


It is the ideal scalability. 


3 Special Applications 

In this section, we first discuss some tridiagonal systems arising in CFD applications, the symmetric 
and anti-symmetric Toeplitz tridiagonal systems. Then two variants of the PDD algorithm, the 
reduced PDD algorithm and the PDD algorithm for periodic systems, will be presented. 


3.1 Toeplitz Tridiagonal Systems 


A Toeplitz tridiagonal matrix has the form 


b c 
a b c 

. . c 

a b 


[a, b, c]. 


(28) 
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Symmetric Toeplitz tridiagonal systems are often arise in solving partial differential equations and 
in other scientific applications. Compact finite difference scheme [12] is a relative new scheme for 
solving PDE’s. Because of its simplicity and high accuracy, it has been widely used in practice. 
Using the compact scheme, the general approximation of a first derivative has the form: 


Pfi - 2 + a f'i- 1 +Ji+ «/i+ 1 + Pf'i + 2 = C 


ft +3 ~ ft— 3 . ,fi + 2 - fi- 


6h 


+ b- 


Ah 


+ CL 


fi+l ~ fi-1 

2 h 


Letting 

Q ~ \'P ~ 0,a = Y ,b = £’ c = °’ ( 29 ) 

the scheme becomes formally sixth order accurate and the resulting system is [5, 1, 5], a symmetric 
Toeplitz tridiagonal system. Similarly, the general approximation of a second derivative has the 
form 


at" . _ t" , t" , _ t" , at" _/i +3 - 2/i + /i_ 3 L fi+2 ~ %fi + fi-2 , _ fi+l ~ %fi + fi-1 

Pfi - 2 + «/t- 1 + fi + «/«+! + Pfi+2 = c — + ° TTo + a - 


9h 2 


Ah? 


h? 


For 

a = yp/3 = 0,a = = ^j-,c = 0, (30) 

a sixth order difference scheme is obtained, and the tridiagonal system is symmetric and Toeplitz, 
[n,l>n]- Discretized in time, the one dimensional wave equation u t = a- u x and the heat equation 
Ui — a • u xx can be represented as 


/"+ 1 = u n + At-o-u", 


(31) 


and 


i n+1 = u n + At ■ a • u” 


(32) 


respectively. 

Using the compact scheme, u” and u xx are defined by symmetric Toeplitz tridiagonal systems. 
Therefore, the solutions can be obtained by solving a sequence of symmetric Toeplitz tridiagonal 
systems. Using ADI methods [17], parabolic and hyperbolic systems can be solved by solving a 
sequence of symmetric Toeplitz tridiagonal systems. 

Anti-symmetric Toeplitz tridiagonal systems also arise in solving PDEs [17]. For instance, to 
solve the wave equation u t + a • u x — /, we begin with the formula 


u t = 


u(t + k, x) — u(t, X ) 


+ 0(k 2 ) 


(33) 
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for ut evaluated at (i + We also use the relation 


».(< + £*,*) =^±^^ + o(k 2 ) 

_ 1 | u (t+k,x+h)-u{t+k,x-h) t u(t,x+M-^(t.^-Mj f 0(k 2 ) + Q(h 2 ) 


Using these approximations for ut + a • u x = / about (t + | k,x ), we obtain 


, ca - gt\ + <±1 - fj?i ± & 

k +a 4 h 2 


( 34 ) 


(35) 


or, equivalently, 


a \,n+l , „n+l aA ? ,n+l _ _±l ? ,n . ».» , ^ + l(/n+l + fn) 

J» ra+ 1 + V m _ "^m-l - 4 *Wl + + 4 U m-1 + 2 Um J m ’ 

The left side is an antisymmetric Toeplitz tridiagonal matrix , A = 


a\ 


a\ 


aA 


(36) 


3.2 Periodic Tridiagonal Systems 

Many PDE’s arisen in real applications have periodic boundary conditions. For instance, to study 
a physical phenomenon of a large object, we often simulate only a small portion of it and then 
apply periodic boundary conditions on each of the portions. The resulting linear systems have the 
form of 


b 0 c 0 °o 

ai 61 Ci 


A = 


(37) 


C n — 1 


Ctn — 2 ^n— 2 ^n — 2 

An- 1 ^n— 1 


and are called periodic tridiagonal systems. On sequential machine, periodic tridiagonal systems 
are solved by combining the solutions of two different right-sides [7], which increases the operation 


count from 8n — 7 to 14n — 16. 

The PDD algorithm can be extended to periodic tridiagonal systems. The difference is that, 
after dropping tuj 0 , and »£l lt the matrix Z becomes a periodic system of order 2 p: 


Z = 


1 

,(i) 


1 0 
0 1 


w 


(p- 1) 

TO — 1 




v 


( p - i) 
0 


w 


(P-2) 

m — 1 


l 


(38) 
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The dimension of Z is slightly higher than in the non-periodic case, which simply makes the load 
on the Oth and (p-l)th processor identical to load on all of the other processors. The parallel 
computation time remains the same. For periodic systems, the communication at step 3 and 4 
changes from one dimensional array communication to ring communication. The communication 
time is also unchanged. Figure 1 shows the communication pattern of the PDD algorithm for 
periodic systems. 



Figure 1. Communication Pattern for Solving Periodic Systems. 


3.3 The Reduced PDD Algorithm 

In the last step, Step 5, of the PDD algorithm, the final solution, x , is computed by combining the 
intermediate results concurrently on each processor, 

z (,) = z (,) - J/2(i-i)W (,) - V2i u> (,) , (39) 


which requires 4 (n — 1) operations in total and 4m parallel operations, if p = njm processors are 
used. The PDD algorithm drops off the the first element of w,w 0 , and the last element of v, u m _ x , 
in solving Eq. (12). In Section 4.1 - 4.2, we will show that, for symmetric and anti-symmetric 
Toeplitz tridiagonal systems, the wq and u m _ x can be dropped when m is large with the accuracy 
of the final solution unaffected. Further more, we have 


-i m — 1 77i — i 

— i — AY b2i i Y & 2 7( -&),•••,( -hr- 1 ) 7 - 

H a + b^o 1 b2t r h h 


771 — 1 


(40) 


So, when m is large enough, we may drop off i = y , • ♦ m — 1, and uj t -, i = 0, 1, • • • , y — 1, while 


maintaining the same accuracy. If we replace by {?,*, where V{ = V{ for i = 0, 1, - 


’ 2 


- 1 , Vi = 0 , 


for i — y , • * * , m — 1; and replace w by w, where w x — w t for i ~ 
i = 0, !,*•*, y — 1; and use w in step 5, we have 


2 ’ 


• , m — 1, and w x ~ 0, for 
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Step 5’ 


Ax^ = [v, u>] 


J/2(i-l) 

2 / 2 , 


x^ = x^) — Ax^. 


(41) 

(42) 


It requires 2^ parallel operation. Replacing Step 5 of the PDD algorithm by Step 5’, we get the 
reduced PDD algorithm which requires 15^ — 4 parallel computations. 


4 Accuracy Analysis 

The PDD algorithm is highly efficient, perfectly scalable, but it is only applicable when the inter- 
mediate results vj„_ 1 ,UJo , 0 < i < p — 1, can be dropped out. However this dropping may lead to 
inaccurate or even wrong solution. Thus an accuracy study is essential for applying the PDD al- 
gorithm. Some study have been done for the accuracy of the PDD algorithm. Sufficient conditions 
have been given [24, 2]. However, the study is for the general case. The conditions given in [24] 
are difficult to verify and the accuracy bound is large. In this section we focus on a particular class 
of tridiagonal systems, symmetric and anti- symmetric Toeplitz tridiagonal systems. Our analysis is 
four fold. First, we give the decay rate of i = 0, • • • ,p — 1. They are the entries treated 

as zeros by the PDD algorithm. Second, the accuracy of the PDD algorithm is studied. Then, 
we analyze the accuracy of the reduced PDD algorithm. All of the above three analysis are for 
symmetric Toeplitz tridiagonal systems. Finally, we extend the results to anti-symmetric Toeplitz 
tridiagonal systems. 

4.1 The Decay Rate of v m _i and wo 

Symmetric Toeplitz tridiagonal systems have the form A = [A,/?, A] = A[l,c, 1], where c = /3/A. We 
assume the matrix A is diagonal dominant. That is |c| > 2. To study the accuracy of the solution 
of Ax = 6, we first study the matrix 
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fa 1 
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i 

c 1 
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B = 


1 . 




6 . 
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1 c j 



b 1 ) 


1 

a / 

where a and b are the real 

solutions of 









O- 
+ 
a 
1 1 

= c, 

b ■ a — 1. 




Since a • b = 1 and |c| > 2, we may further assume that |a| > 1, and |6| < 1. 
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The LDL t decomposition of B is 

B = [6, 1,0] x [0,a,0] x [0, 1,6]. 


Thus 


B x = [0,1,6] 1 x [0, a, 0] 1 x [6, 1, 0] — 1 


1 -b b 2 . (-6) n_1 \ 


a 1 \ 

/ 

1 -b . (— 6) n ~ 2 


a- 1 


1 -b 




1 ) 


\ a" 1 J 

V 


-6 1 

b 2 -b 1 


V (- b ) 


71—1 


Let d = (1,0, • • 0) T , then 

71—1 71—1 




Let 


AB = 


o 

•11 

i=i 


i=2 


b 

0 . . 

■ 0 ^ 


b \ 

0 

0 . . 

• • 


0 

\o 

0 . . 

• oj 


1 0 / 


t — 71 — 1 


(l,0,---,0) = ve t , 


and 


B = B 4- AB = [l,c,l] 

Then, by the matrix modification formula (7), the solution of By = d is 

y = B~ l d= (B + VE^-'d 

= B~'d- B~ l V{I + E T B- l V)~ l E T B-'d 


where 


(I + E t B~ 1 V)- 1 = 


« + o b2i ' 

V^n-1 1 2» 

E T B~ 1 d= ld=o_0_ 


71—1 71—1 


71 — 1 


= £ e*n-br- i f. 


t—O i=l 


1 — 71—1 


\ 

6 1 J 

(44) 

(45) 

(46) 
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B~ l V{I + E T E T B _1 d = - ■ 


b Hi=o g 
« + 6E”=“o6 2 * 


( E5 o 1 * 2 ’ \ 


v (-^r 1 ) 


The last element of ?/ is 


J/n— 1 


Mr 1 

a 

Mr 1 

a 


(- 6 ) 


n— 1 


frE ?= ofr 2t (-*)"" 1 


a a + ^EEo ^ 2 ' 

_J ^ 

n — 1 


,l + 6 2 E?Jo b 2 ' 


« \a + bU=ob 2 \ 
( note a ■ b - 1). 


(47) 

(48) 


Thus: 


The first element of y is 


li/n-ll < 


\b\ 


n — 1 


M 


E"=o b 2 ' 


yo 


|yo| = 


1 \ _ 6(1 - 6 2n ) 

1 + 6 2 b 2i ) ~ 1 - 6 2 ( n + J ) 

6(1 - 6 2n ) 


<| 6 |. 


| 1 _ &2(n+l) 

For the original system Ax = d,A = A[l, c, 1], the first element of x is 

yo 


Xo = 


The last element of x is 


Zn-l = 


Vn-l 


(49) 


(50) 


(51) 


(52) 


Since for Toeplitz tridiagonal systems, each submatrix >1,, * = 0, • • — 1, has the same structure 

as A, we have the following lemma: 

Lemma 1 If ? where m — n/p , is less than machine accuracy , then v ^_ 1 , i = 0, • • • ,p — 1, can 
be replaced by zero without affecting the accuracy of the final solution of Ax = d. 


With similar arguments, we can prove that for d = (0, • • *0, 1) T , Ax = d has solution 


_ yn-(t+i) 
x x — 


(53) 
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In particular 


£ n _l 

Xo 


- 2flL 

“ A 

__ y to — i 


Combining with Lemma 1, we have: 

Theorem 1 If , m = n/p, ts /ess i/ian machine accuracy , //ien the PDD algorithm gives an 
approximation to the true solution within machine accuracy. 


4.2 Accuracy of the PDD Algorithm 

Theorem 1 says that if v m -i,WQ are less than machine accuracy, the PDD algorithm gives a sat- 
isfactory solution. In most scientific applications, the accuracy requirement is much weaker than 
machine accuracy. We now study how the decay rate of v m _i 7 w 0 influences the accuracy of the 
final solution. Our study starts at the matrix partition formula (7). 

Let 

y = (/ + E T A- 1 V)- 1 E T A- 1 d. (54) 

Substitute y into equation (7), we have 

x = A~ x d - Ar x Vy 

E t x = E T A~ l d - E T A~ l V - y (55) 

= (/ + E T A- 1 V)y - E t A^V . y = y. 

Let y * be the corresponding solution of the PDD algorithm, 

y * = ( I+E t A~ 1 V - D)~ 1 E T A~ l d, 

where D is the 2(p — 1) x 2(p - 1) matrix which contains all the elements. Combined 

with Eq.(54) we have 


(/ + E T A- l V)y - (/ + E T A-' V - D)y* = 0, 


That is 

(y* - y) = (/ + E T A- l V - D)~ X D ■ y. 

Let x* be the corresponding final solution of the PDD algorithm. Then 

x m = A~ x d- A~ l V -y m 
x-x * =A~ 1 V(y*-y) 

= A~ X V(I + E t A~ 1 V - D)~ X D ■ y 
= A~ X V{I + E T A~ X V - D)~ X D ■ E t x. 
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Thus, 


~7j rr 11 < II A-W(I+ E t A~ 1 V - D)~ l DE T \\. 


(56) 


The inequality (56) holds for general tridiagonal systems. In the following we assume the special 
structure of symmetric Toeplitz tridiagonal system to compute the norm of its right side. We use 
the 1-norm in our study. As discussed in the last section, 


(I + E t A~ 1 V - D)~ l 



\ 


(57) 


where Z{ are 2x2 matrices: 






(58) 


For symmetric Toeplitz tridiagonal systems - u^_ x = v ^ - a, and = 6, 

for i = 0, ■ • • , p — 1. So, for our applications, 


Z{ — Z\ — 



(59) 



(60) 


D-E t stretchs D from a 2 (p- 1) x2(p- 1) matrix to a 2(p- 1) x n matrix. Each column of D ■ E T is 
either a zero column or contains only one possible non-zero element, b. (7 + E T A~ 1 V — D)~ 1 DE T 
is a 2(p — 1) x n matrix. Each of its column either is a zero column or contains only two possible 
non-zero elements Ci,C 2 , where 



For our application A,- = Ai, and = c^_j = A, i = 0, • • -,p — 1. So, v W = v, w = w,i — 
— 1 (see Eq. (15)). (A~ 1 V)(I + E T A~ 1 V - D)~ l D ■ E r is an n x 7? matrix, with each 
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column is either zero column or contains only c\w, c 2 v or c 2 w,ciu respectively. Thus, 

|| A~ l V(I + E t A~ 1 V - D)~ 1 DE t || < max{||c 2 u|| + ||c x w||, ||c x u|| + ||c 2 w||} 
< MINI + IdUMI ( note |H| = ||u||, Eq.{ 53)) 

= (N + |cx|)||v|| = ^(1 + |a|)|M| = p^nylHI. 


From our results given in Section 4.1, 

fe(l - 6 2,n ) 


a = 


A(1 - 6 2 ( m+1 >) 


< if I, 1*1 < 


and 


-i(>- ( ess* 62 ‘. **/(-»). • • A-tr-'j 

-5=!sJ5T=>Tis , ^---4-*r- , ) T . 


We have 


Ml = 
< 

< 

< 

< 


1— fc 2 

TT^m — 1 

Z-a=0 

(i+MT 

(_ip( X _t2( "■>-<)) | 

A(a-6 2m + x ) 
1— 6 2 

1— 6 2 | 
k )(i-l*n 

A(a-6 2m + x ) 

i ia+itr 

til 1 

(l-6' 2 )(!- 

I(l-lbl m ) 

M) 

1 Ad j | l_62(m+l) 1 

(1-1*1) 


. i-IM . * 

]Xa] 1 — 16 1 


( note |a| > 1, \b\ < 1) 


- |A|(|a|-l) 

Combining the inequalities (56) and (63) we obtain the final results 

i*i 

|A(1 - |a|)| x (|a| - 1) 


x — x 


< 


S |A 2 (l-|a|)(|a|-l) 

JMT 


( 63 ) 


(64) 


(65) 


( 66 ) 

(67) 


Inequality (66) shows how the values of u m _ x and w 0 influence the accuracy of the final results. 
Inequality (67) gives an error bound of the PDD algorithm. When | < 1, inequality (67) can be 

simplified to 


\x — x 


\x\ 


< 


l*l r 


|A|(|A| - |6|)(|a| - 1) 

4.3 The Accuracy of the Reduced PDD Algorithm 

For the sake of writing, in this section and next section we assume m = n/p is an even integer. Let 
V be the matrix corresponding to V in Eq.(9) such that A _1 V results the vectors v, w (see Section 
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3.3), and let x' be the solution of the reduced PDD algorithm. Then 

x' = A~ l d - A~ l V{I + E T A~ 1 V)E T A- 1 d. (68) 

As in Section 4.2, we let y = (/ + E T A~ 1 V)E T A~ l d. Notice that x* is the solution of the PDD 
algorithm (see Section 4.2). By Eq. (7) and (55), 

x' - x* = {A~'V - A~ l V)y = (A~ l V - A~ 1 V)E T x, 


Therefore, 


1 1 

IMI 


< I \{A~ X V - A _1 F)|| • ||£ T || = WA^V - A~'V || = \\v - u| 

l 


Ma+i>E^o i2 ‘) 


< 

< 

< 


l-fe 2 

l_62(m+0 


v-m-1 

E, i = rs. 1-6 5 


(1 


+ |6| m +p(|fe|?(l-[6|?)) 
(l-6l)(l-|6|) 


m 

6 T 

i-M* 

Ao 

. M 

(1 _i t|"»+i) (1 -|fr|) 


Equation (69) gives the accuracy of the reduced PDD algorithm. 


UfzzgU 

INI 


< 


Ilf - s*ll Ik* - flj 
INI INI 


i&r 

|A| (|A|-|SiSr|)(H-i) 


1**1 

|A|(|o|-l) 


(69) 

(70) 


4.4 Anti-Symmetric Toeplitz Tridiagonal Systems 

The accuracy analysis given by Sections 4.1 - 4.3 is for symmetric Toeplitz tridiagonal systems. In 
this section we extend the results to anti-symmetric Toeplitz tridiagonal systems. We assume that 
m = n/p is an even number. 

An anti-symmetric Toeplitz tridiagonal matrix A has the form A = [- A,/?, A] = A • [-l,c,l]. 
Let B = [— 1, c, 1]. Then, for the corresponding matrix B (see Section 4.1) 


B = [6,1,0] x [0, a, 1] x [0, 1, —6], 


where a, b are the solutions of 

6 + c = c, 6 • a = — 1. (71) 

Comparing with symmetric case, the only difference are —6 in matrix [0, 1, —6] and b-a = — 1 in Eq. 
(71). Following the steps given in the study of symmetric systems, we have computed ihe vectors 
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of v and w in Eq. (15), 


= (sav-D » l .-.(-*r- 1 ) r i 1 




it? = 


M^Er=o(- i ) , ‘ 2 ') 


We can see for anti-symmetric Toeplitz tridiagonal systems = a, and = 

—Wq^ = = b, for i = 0, • • -,p - 1. Thus, the inequality (63) remains true for anti-symmetric 

cases. 

By Eq. (72), we have 


(_ft) m . (i + 6 2) 


Mr -1 

m_1 A(a + 6££ c 1 (-l) i fc 2< ) A(1 + 6 2 ( m + 1 )) ’ 


1*1 < 


|6 m (l -f £> 2 )| 


and 


(*) 

a = v \' = 


|A| 


-6 • (1 — 6 2m ) 

0 “ A(a + 6E^o 1 (-l) , ^‘') ~ A(1 + i> 2 ( m+1 ))’ 
— 6 • (1 — b 2m ) 


a = 




|A(1 + 6 2 ( m +i))| 

For the bound of the norm of vector v (see Eq. (65)). When b ■ a = —1, 


Ml < 


^ 1 l-|6| 2 ( m+1 > ^ 1 

- |Ao(l+62(^+l))| ' IH6] - |A(|a|— 1)| * 


Tl 


The corresponding relative error 


||® — i’ 


1*1 


|A(1 — |a|)(|a| — 1)| 


in terms of a and 6; and 

Ik - *11 


< 


|6| m (l + b 2 ) 


*11 |A 2 (1 - |o|)(|o| - 1)| |A(|A| — 


1 

6| m (l + b 2 ) 

|A(|A| — 

6(1— 6 2m ) 
l+&2(m+l) 

(H-i)l 


(73) 


(74) 
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System 

Matrix 

Best 

sequential 

the PDD 

Computation 

Communication 

Single 

System 

Non-periodic 

8n-7 

17* -4 

2a + 12/3 

Periodic 

14n-16 

17* -4 

2a + 12 (3 

Multiple 

right-side 

Non-periodic 

(5 n - 3) * nl 

(9* + l)*nl 

(2a + 8(3) * nl 

Periodic 

(7 n — 1) * nl 

(95 + 1 )*" 1 

(2a + 8/3) * nl 


Table 1. Computation and Communication Counts of the PDD Algorithm 


in terms of a and b. When 


< 1, we have 


||*- *11 / |6| m (l + b 2 ) 

IN - |A(|A| - |6|)(|«| - 1)| 


For the reduced PDD algorithm, when the system is anti-symmetric, we have 


and 



< | 

\\v — v\\ 



1 1 + 6 2 

| (l+|6| 2 ('"+»)(|6|W2 ( i_| 6r /2)) 1 


A l-)-i,2(m+l) 
|6| m / 2 

| (1+6 2 )(1 — 16|) I 



< 


1 

£>| m (l + b 2 ) 

|A(|A|- 

i>(l— 6 2m ) 

(M - i)l 


| 6| m /2 

|A|(M-1)- 


5 Experimental Results 


Table 1 gives the computation and communication count of the PDD algorithm. Since the tridi- 
agonal systems arising in both ADI and in the compact scheme method are multiple right-side 
systems, the computation and communication count of solving multiple right-side systems is also 
listed in Table 1, where the factorization of matrix A is not considered and nl is the number of 
right-sides. Note for multiple right-side systems, the communication cost increases with the num- 
ber of right-sides. Table 2 gives the computation and communication counts of the reduced PDD 
algorithm. As the PDD algorithm, it has the same parallel computation and communication counts 
for periodic and non-periodic systems. 

A sample matrix is chosen to illustrate and verify the algorithm and theoretical results given 
in previous sections. The sample matrix A is a resulting matrix of the compact scheme, 

A = 1, |]- (75) 


19 




System 

the Reduced PDD 

Computation 

Communication 

Single system 

15* -4 

2a + 12 (5 

Multiple right-side 

(7* + l)*nl 

(2a + 8/9) * nl 


Table 2. Computation and Communication Counts of the Reduced PDD 


For matrix A, 


A = [i i, i] = i • [i, 3 , i] = ^ • ([ b , 1 , 0] x [0, a, 0] x [0, 1 , b] - A B) , 


where A B is given by Eq.(44), and 


, 1 0 3 + y/E L _ 3-\/5 

A- 3 ,c-3,a- 2 ,6- 2 . 


(76) 



Figure 2. Measured and Predicted Decay Rate. 

The PDD algorithm was first implemented on a Xllr4 terminal to solve the corresponding 
periodic system of Ax = d for accuracy checking. Then the algorithm was implemented on a 
32-node Intel/ 860 to measure the speedup over Thomas algorithm [7], a commonly used practical 
sequential algorithm for periodic tridiagonal systems. For accuracy checking, all the measured and 
predicted data have been converted by a logarithm function with base ten to make the difference 
visible. Figure 2 depicts the decay rate of of matrix A, where the x-coordinate is the order 
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of the sub-system A, and the y-coordinate is the value of v m -\. We can see that the theoretical 
bound given in Section 4.1 coincides with the measured value. 



Figure 3. Measured and Predicted Accuracy of the PDD Algorithm. 

Accuracy comparisons of the PDD and the reduced PDD algorithms are given in Fig. 3 and Fig. 
4 respectively. For the accuracy comparisons, the right-side vector, d, was randomly generated. 
The x-coordinate is the order of matrix A, and the y-coordinate is the relative error in the 1-norm. 
These two figures show that our accuracy analysis provides a very good bound. 

Figure 5 and 6 give the speedup of the PDD algorithm over Thomas algorithm. For single 
system, the order of matrix A is limited by the machine memory for n = 6400. For multiple right- 
sides, the system is limited for n = 128 and nl = 4096. From Fig. 5 we can see that the speedup 
of solving a single system increases linearly with the number of processors. Figure 6 shows that 
the linear increasing property does not hold for multiple right-side systems. The lower speedup is 
due to the increase of communication cost. Since the Intel/860 has a very high (communication 
speed)/(computation speed) ratio, we can expected a better speedup on an Intel Paragon or even 
on an Intel/iPSC2 [18] multicomputer. 

6 Conclusion 

A detailed study has been given for the efficient tridiagonal solver, the Parallel Diagonal Dominant 
(PDD) algorithm. The presented PDD algorithm is slightly different from the originally proposed 
version [21] and is also extended to periodic systems. A variant, the reduced PDD algorithm, was 
also introduced. Accuracy analysis is provided for a class of tridiagonal systems, the symmetric 
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Figure 4. Measured and Predicted Accuracy of the Reduced PDD . 
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Figure 5. Measured Speedup Over Thomas Algorithm. 
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Speedup 



Figure 6. Measured Speedup Over Thomas Algorithm, 
4096 Systems of Order 128, Factorization Time Not Included 


and anti-symmetric Toeplitz tridiagonal systems. Implementation results were provided for both 
accuracy analysis and for the proposed algorithm. They showed that the accuracy analysis provides 
a very good theoretical bound and that the algorithm is highly efficient for both single and multiple 
right-side systems. The algorithm is a good candidate for large scale computing, where the number 
of processors and the problem size are large. It is a good choice for the newly emerged massively 
parallel machines, such as Thinking Machine Corporations’s CM-5 and Intel’s Paragon. The discus- 
sion is based on distributed-memory machines. The result can be easily applied to shared-memory 
machines as well. 

The PDD algorithm and the reduced PDD algorithm proposed in this paper can be extended 
to band systems and block tridiagonal systems. The accuracy analysis, which gives a good, simple 
relative error bound, is for symmetric and anti- symmetric Toeplitz tridiagonal systems only. It is 
unlikely that the analysis can be extended for general case with the same technique. 
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